Approximate string matching for high-throughput sequencing

نویسنده

  • Enrico Siragusa
چکیده

Over thepast years, high-throughput sequencing (HTS)hasbecomean invaluablemethod of investigation in molecular and medical biology. HTS technologies allow to sequence cheaply and rapidly an individual’s DNA sample under the form of billions of short DNA reads. The ability to assess the content of a DNA sample at base-level resolution opens the way to a myriad of applications, including individual genotyping and assessment of large structural variations, measurement of gene expression levels and characterization of epigenetic features. Nonetheless, the quantity and quality of data produced by HTS instruments call for computationally ef icient and accurate analysis methods. In this thesis, I present novel methods for the mapping of high-throughput sequencing DNA reads, based on state of the art approximate string matching algorithms and data structures. Read mapping is a fundamental step of any HTS data analysis pipeline in resequencing projects, where DNA reads are reassembled by aligning them back to a previously known reference genome. The ingenuity of approximate string matching methods is crucial to design ef icient and accurate read mapping tools. In the irst part of this thesis, I cover practical indexing and iltering methods for exact and approximate stringmatching. I present state of the art algorithms and data structures, give their pseudocode and discuss their implementation. Furthermore, I provide all implementationswithin SeqAn, the generic C++ template library for sequence analysis, which is freely available under http://www.seqan.de/. Subsequently, I experimentally evaluate all implemented methods, with the aim of guiding the engineering of new sequence alignment software. To the best of my knowledge, this is the irst study providing a comprehensive exposition, implementation and evaluation of such methods. In the second part of this thesis, I turn to the engineering and evaluation of readmapping tools. First, I present a novel method to ind all mapping locations per read within a user-de ined error rate; this method is published in the peer-reviewed journal Nucleic Acids Research and packaged in a open source tool nicknamedMasai. Afterwards, I generalize this method to quickly report all co-optimal or suboptimal mapping locations per read within a user-de ined error rate; this method, packaged in a tool called Yara, provides amore practical, yet sound solution to the readmapping problem. Extensive evaluations, both on simulated and real datasets, show that Yara has better speed and accuracy than de-facto standard read mapping tools.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Parallel Algorithm for the Fixed-length Approximate String Matching Problem for High Throughput Sequencing Technologies

The approximate string matching problem is to find all locations at which a query of length m matches a substring of a text of length n with k-or-fewer differences. Nowadays, with the advent of novel high throughput sequencing technologies, the approximate string matching algorithms are used to identify similarities, molecular functions and abnormalities in DNA sequences. We consider a generali...

متن کامل

BlastGraph: intensive approximate pattern matching in string graphs and de-Bruijn graphs

Many de novo assembly tools have been created these last few years to assemble short reads generated by high throughput sequencing platforms. The core of almost all these assemblers is a string graph data structure that links reads together. This motivates our work: BlastGraph, a new algorithm performing intensive approximate string matching between a set of query sequences and a string graph. ...

متن کامل

slaMEM: efficient retrieval of maximal exact matches using a sampled LCP array

MOTIVATION Maximal exact matches, or just MEMs, are a powerful tool in the context of multiple sequence alignment and approximate string matching. The most efficient algorithms to collect them are based on compressed indexes that rely on longest common prefix array-centered data structures. However, their space-efficient representations make use of encoding techniques that are expensive from a ...

متن کامل

Approximate string matching as an algebraic computation

Approximate string matching has a long history and employs a wide variety of methods (see e.g. the survey [2]). We consider a variant of approximate matching that compares a fixed pattern string to every substring in the text string by a rational-weighted edit distance (e.g. the indel distance, defined as the number of character insertions and deletions, or the indelsub/Levenshtein distance, wh...

متن کامل

Parallel Algorithms for Degenerate and Weighted Sequences Derived from High Throughput Sequencing Technologies

Novel high throughput sequencing technologies have redefined the way genome sequencing is performed. They are able to produce millions of short sequences in a single experiment and with a much lower cost than previous methods. In this paper, we address the problem of efficiently mapping and classifying millions of degenerate and weighted sequences to a reference genome, based on whether they oc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015